In [1]:
from IPython.display import HTML, display, Image
HTML('''
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:

Term 3 LT9 | Christine Albao, William Delfin, Felicismo Lazaro III, Ian Lucas, Loraine Menorca


I. Executive Summary

 With the rise of social media in recent years, different industries have started to take notice and explore how they can use the power of these platforms into improving and expanding their business [10]. One of these industries is Marketing. Marketing has undergone a revolution with the rise of social media. No longer do individuals need to spend large amounts of money for advertising in order to get their products or services known [9]. Facebook, TikTok, twitter, and Instagram are a few examples of social media platforms now being used by different businesses and industries in their marketing operations.

 Along the social media revolution of marketing came about the rise of influencers and influencer marketing. Influencers are social media personalities or entities who are known to be knowledgeable or entertaining about a certain topic. These influencers then grow their community by gaining followers within the social media platforms. With a big enough following, they start to gain the power to influence buying habits and decisions as well as trends on different products withing their community. And in the Philippines, influencer marketing effectiveness is compounded by the fact that the Philippines is the second most active country in terms of social media consumption worldwide.

 In this project, we explored how we can use the power of clustering and recommender systems to give out suggestions of which influencers a brand can tap into for potential partnerships and advertising deals in Twitter. As a specific example for this project, we used the brand dove and their twitter account as an example of a brand that needs influencer recommendations. We then used a twitter database compiled from scraping tweets of dove, their followers and other major Filipino influencers like Aldub to be used for the clustering and recommender systems.

 After undergoing the standard data cleaning, exploration and preprocessing, we then performed clustering on the datapoints and have identified three main clusters, namely, News and Entertainment Outlets, Micro and Macro Influencers, and Celebrities. For the recommender system, we used a content-based recommender system. This system requires the creation of item and user profiles. For the Item-profiles we used the account as rows, and their tweet keywords as features. The user profiles on the other hand was aggregated by using the number of followers and tweets. Results of the clustering and recommendations of the recommender systems have identified that some key personalities like Tirso Cruz or Liza Soberano, to name a few, are ideal candidates for Dove to partner with. These personalities also align with the values that Dove stands for like empowerment and self-love.


II. List of Tables and Figures

Table 1. Influencer Profile (df_partners) - Data Dictionary
Table 2. Influencer Tweets (df_partner_tweets) - Data Dictionary


Figure 1. Data Source Overview
Figure 2. Distribution of Profile Metrics
Figure 3. Distribution of Followers and Tweet Count
Figure 4. Distribution of Tweet Metrics
Figure 5. Distribution of Account-Following count […]
Figure 6. The Project Methodology
Figure 7. The Model Pipeline
Figure 8. Clustering Internal Validation Criteria
Figure 9. Final k-medoids clustering retaining only 3 clusters
Figure 10. Cluster 1: News/Entertainment Outlets, 54 Twitter accounts
Figure 11. Cluster 2: Celebrities, 194 Twitter accounts
Figure 12. Cluster 3: Social Media Macro & Micro Influencers, 659 Twitter accounts
Figure 13. Tweets of Recommended News/Entertainment Outlets
Figure 14. Tweets of Recommended Celebrities
Figure 15. Tweets of Recommended Social Media Micro- & Macro-Influencers
Figure 16. Recommender System Performance


III. Problem Statement

     How can brands know which influencers best match their image and values, and are good potential partners for their marketing campaigns? - To explore the application of Clustering and Recommendation Systems into searching for potential partner influencers for a brand. - To identify personalities than are ideal candidates for the brand Dove to partner with.

IV. Motivation

 The power of social media is self-evident -- it builds relationships, shares experiences, and even educates people to a great extent. In marketing, however, social media presence has become essential. It allows brands to reach people and potential clients in ways it has been unable to do pre-social media. While brands and companies can run ads on various platforms through their business accounts, another arguably more effective and powerful avenue is the utilization of influencer marketing. According to the Influencer Marketing Hub: "Influencer marketing involves a brand collaborating with an online influencer to market one of its products or services." [11]

 The effectiveness of influencer marketing is no longer up for debate. However, choosing who to partner with can be a significantly involved and relatively expensive process -- not only do companies have to consider the influencers' reach, but more importantly, they have to consider the personality's values and how it aligns with theirs. A typical way for brands to reach influencers is through ad agencies. Agencies usually have a pool of influencers they can tap to partner with clients, and while this may greatly hasten the shortlisting process, agency fees are significant. High agency fees are why some brands choose to contact influencers directly instead. The obvious downside is the time and effort involved in scouting and communicating with potential partners.

 Machine learning and information retrieval techniques help alleviate the tedious process of choosing who to partner with and can significantly improve the quality of the outcome. Using a brand's current social network and aggregating networks of the most influential personalities can provide a list of potential celebrity and influencer (micro and macro) partners -- greatly streamlining the process and likely cutting on expenses.


V. Data Source

The necessary details of the scraping process is already detailed in this notebook in this section. However, a separate notebook lt9_dmw2_finalproject_scraping.ipynb file which lays out the whole process is also provided for more information.


Figure 1. Data Source Overview.
The profiles of the brand and influencers were scraped via Twitter API. Similarly, the tweets of the influencers were collected.


In [2]:
import numpy as np
import pandas as pd
import sqlite3

# Plotting tools
import seaborn as sns
import matplotlib.pyplot as plt
import plotly 
import plotly.graph_objects as go
import plotly.express as px

# NLP tools
from collections import Counter
from tqdm import tqdm, trange
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
from IPython.display import display, HTML

# Clustering
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
from scipy.spatial.distance import euclidean, cityblock
from sklearn.base import clone
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.cluster import OPTICS, cluster_optics_dbscan
from sklearn.neighbors import NearestNeighbors
from pyclustering.cluster.kmedoids import kmedoids
from scipy.cluster.hierarchy import dendrogram, fcluster

from sklearn.metrics import (calinski_harabasz_score,
                             silhouette_score,
                             davies_bouldin_score)

# Recommentation System
from sklearn.metrics import dcg_score, ndcg_score
from scipy.spatial.distance import cosine

np.random.seed(143)
randstate = 143

# set global plotting parameters
custom_sns_params = {'lines.linewidth': 2, 'font.size': 12,
                     'axes.titlesize': 14, 'axes.labelsize': 12,
                     'xtick.labelsize': 12, 'ytick.labelsize': 12,
                     'legend.fontsize': 12, 'legend.fancybox': True}
sns.set_theme('notebook', style='ticks', rc=custom_sns_params)
colors = ['#003B7F', '#EDC254']
custom_palette = sns.blend_palette(colors) # , n_colors=5
sns.set_palette(custom_palette)

# define a global parameter figure counter
fig_n = 2
def fig_count():
    global fig_n
    fig_n += 1
    return fig_n
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-xnqb9_l4 because the default path (/home/mmenorca/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[nltk_data] Downloading package punkt to
[nltk_data]     /home/msds2023/mmenorca/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/msds2023/mmenorca/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/msds2023/mmenorca/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

V.A. Influencers and their Tweets

Influencers

 We considered the users that an account follows to be its social network, and defined influencers as users who have **at least 50,000** followers. In this work, 3 social networks were considered as the pool of potential partners that a brand can be matched with - Anne Curtis', Alden Richards, and the brand's itself. Overall, 907 influencers were collected.

 Since Anne Curtis and Alden Richards were among the top 10 most followed Twitter users in the Philippines as of 2016 [1], it was assumed that their social networks could represent a good sample set of influencers in the country. On the other hand, the brand's social circle was also added in the pool of potential partners assuming that they represent influencers that the brand already considers a good match.

 The `https://api.twitter.com/2/users/:id/following` endpoint was used to get the 3 social networks that make up the 907 pool of influencers. This query returns the profile of each whose contents are described briefly in Table 1.


Table 1. Influencer Profile (df_partners) - Data Dictionary
Feature Data Type Description
id string unique identifier for each twitter user
description string twitter bio description
created_at datetime date of account creation
username string twitter user username
protected integer indicates if twitter account is protected or not
name string real or formal name of twitter user
url string url link of twitter account
location string address of twitter user
followers_count integer indicates the number of followers the user has
tweet_count integer indicates the number of tweets the user has posted
listed_count integer indicates the number of lists the user is in
included integer indicates if twitter user is included into the consideration set
rating integer indicates the rating of twitter user

Tweets

 100 of the most recent tweets of each influencer as of 05 March 2023 was collected. The content of each influencers' tweets were assumed to reflect their ideals, values, and tone. This was used as the basis for evaluating whether an influencer is a good match with the brand. Overall, 88,135 tweets were gathered.

 The `https://api.twitter.com/2/users/:id/tweets` endpoint was used to get the tweets of each influencer. This contents of the scraped tweets are described briefly in Table 2.


Table 2. Influencer Tweets (df_partner_tweets) - Data Dictionary
Feature Data Type Description
lang string indicates the language used for the tweet
id integer unique identifier of the tweet
created_at datetime date of creation of the tweet
possibly_sensitive integer indicates if topic of tweet is sensitive or not
author_id integer unique identifier for the author of the tweet
conversation_id integer unique identifier for the specific tweet and its replies
text string content of the tweet
in_reply_to_user integer indicates if the tweet is a reply to another user id
retweet count integer indicates the number of times the tweet was shared
reply_count integer indicates the number of replies the tweet has recieved
like_count integer indicates the number of likes the tweet has recieved
quote_count integer indicates the number of times the tweet was quoted by another user
impression_count integer indicates the number of impressions the tweet has garnered
In [3]:
# Main database
sqlite_db = 'dmw2_final_project.db'
conn = sqlite3.connect(sqlite_db)


# Load necessary tables
tbl_partners = 'partners'
df_partners  = pd.read_sql(f"SELECT * FROM {tbl_partners}", 
                           parse_dates=['created_at'], con=conn)

tbl_partner_tweets = 'partner_tweets'
df_partner_tweets = pd.read_sql(f"SELECT * FROM {tbl_partner_tweets}",
                                parse_dates=['created_at'], con=conn)
In [4]:
df_partners.head()
Out[4]:
id description created_at username protected name url location followers_count following_count tweet_count listed_count included rating
0 49616273 China's national English language newspaper, u... 2009-06-22 12:41:39+00:00 globaltimesnews 0 Global Times https://t.co/LgROMWT42V Beijing, China 1880407 538 229828 0 1.0 1.0
1 293932241 Always Thankful. ask@vmgasia.co 2011-05-06 07:09:52+00:00 AlyssaValdez2 0 Alyssa Valdez https://t.co/NuWOu0Mt66 None 2258385 608 5304 445 1.0 1.0
2 42335426 Filipina Wife, Mother of 5, Homemaker, Actress... 2009-05-25 02:48:59+00:00 mommymaricel 1 Maricel Laxa-P. http://t.co/T8kqyg478t Manila, Philippines 69898 109 6333 104 1.0 1.0
3 333955253 Writer, moon child, cat mom, fangirl and dream... 2011-07-12 10:38:32+00:00 iamAlyloony 0 Aly 🌑🌸 https://t.co/Y27e5bUvnm PH 340964 178 136503 186 1.0 1.0
4 58155585 Curiouser and curiouser. 2009-07-19 08:20:45+00:00 KianaVee 0 Kiana V https://t.co/1MMICAW9u5 None 90953 844 20016 72 1.0 1.0
In [5]:
df_partner_tweets.head()
Out[5]:
lang id created_at possibly_sensitive author_id conversation_id text in_reply_to_user_id retweet_count reply_count like_count quote_count impression_count
0 en 1632275246328823808 2023-03-05 07:01:59+00:00 0 49616273 1632275246328823808 China's deficit-to-GDP ratio is set at 3 perce... None 0 0 0 0 55
1 en 1632274933089808386 2023-03-05 07:00:44+00:00 0 49616273 1632274933089808386 Fifteen national advisors issued a joint propo... None 0 0 1 0 259
2 en 1632267673399701504 2023-03-05 06:31:54+00:00 0 49616273 1632267673399701504 The number of giant pandas at SW China's Cheng... None 1 0 12 0 2478
3 en 1632258378243391488 2023-03-05 05:54:57+00:00 0 49616273 1632258378243391488 China’s major public hospitals should increase... None 3 2 7 0 3139
4 en 1632245705749450753 2023-03-05 05:04:36+00:00 0 49616273 1632245705749450753 "The Wandering Earth 2 let the audience see a ... None 7 1 9 1 3662

V.B. Brand

 This work considers Dove as the stakeholder who aims to find social media influencers that best match their brand image and values. Moving forward, "Dove" and the "Brand" may be used interchangeably.

 Dove's social network was obtained to determine which users are already "liked" by the brand. The tweets of Dove's social network were used as a basis for the brand's tone, values, and ideals that would be compared with that of the other potential partners in the matching process.


VI. Data Exploration

INSIGHTS Null observations can be seen on 'url', and 'location' features of the df_partners dataframe. Features are combinations of numeric, categorical and, object descriptions.
In [6]:
df_partners.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 915 entries, 0 to 914
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               915 non-null    object             
 1   description      915 non-null    object             
 2   created_at       915 non-null    datetime64[ns, UTC]
 3   username         915 non-null    object             
 4   protected        915 non-null    int64              
 5   name             915 non-null    object             
 6   url              754 non-null    object             
 7   location         687 non-null    object             
 8   followers_count  915 non-null    int64              
 9   following_count  915 non-null    int64              
 10  tweet_count      915 non-null    int64              
 11  listed_count     915 non-null    int64              
 12  included         915 non-null    float64            
 13  rating           915 non-null    float64            
dtypes: datetime64[ns, UTC](1), float64(2), int64(5), object(6)
memory usage: 100.2+ KB
INSIGHTS - Most partners created their twitter accounts between 2008 and 2012. Almost 11 to 15 years old account dominate the pool of accounts. Moreover, most of these accounts are publicly available and only 8 accounts are private. This also pertains to the 8 accounts considered "not included" in this project. - For the follower counts, minimum value is 50K while maximum is 113M ensuring that partner accounts have high audience reach. - Only 58 accounts are rated by twitter or those that have their official Twitter badge.
In [7]:
df_partners.describe()
Out[7]:
protected followers_count following_count tweet_count listed_count included rating
count 915.000000 9.150000e+02 915.000000 9.150000e+02 915.000000 915.0 915.000000
mean 0.007650 3.934655e+06 6184.685246 5.919677e+04 9202.324590 1.0 0.064481
std 0.087178 1.147208e+07 50445.232292 1.253672e+05 28201.544379 0.0 0.245742
min 0.000000 5.021600e+04 0.000000 8.000000e+00 0.000000 1.0 0.000000
25% 0.000000 1.817210e+05 195.000000 8.539500e+03 377.500000 1.0 0.000000
50% 0.000000 6.242060e+05 513.000000 2.013600e+04 1356.000000 1.0 0.000000
75% 0.000000 2.431820e+06 1037.500000 4.703600e+04 5994.000000 1.0 0.000000
max 1.000000 1.134631e+08 853552.000000 1.149782e+06 534074.000000 1.0 1.000000
In [8]:
df_partners.hist(figsize=(20,10))
plt.suptitle(f'Fig. {fig_n}: Distribution of Profile Metrics', fontsize=16);
_ = fig_count()
INSIGHTS Since we are looking into high audience reach and highly active active social media accounts, looking into the followers_count and tweet_counts: - Follower counts centered on 3.9M followers: with highest having 113M followers and lowest having 50K followers - Tweet counts, on the other hand, is distributed around 59K tweets with the least active account having 8 tweets and most tweets around 1.1M
In [9]:
plt.rcParams['figure.figsize'] = (12, 5)

plt.subplot(1, 2, 1)
sns.distplot(df_partners['followers_count'])

plt.subplot(1, 2, 2)
sns.distplot(df_partners['tweet_count'])

plt.suptitle(f'Fig. {fig_n}: Distribution of Followers and Tweet Count',
             fontsize=16)
plt.show();
_ = fig_count()
/opt/conda/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
INSIGHTS Null observations can be seen only on 'in_reply_to_user_id' features of the df_partner_tweets dataframe. Features are combinations of numeric, categorical and, object descriptions.
In [10]:
df_partner_tweets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88135 entries, 0 to 88134
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   lang                 88135 non-null  object             
 1   id                   88135 non-null  object             
 2   created_at           88135 non-null  datetime64[ns, UTC]
 3   possibly_sensitive   88135 non-null  int64              
 4   author_id            88135 non-null  object             
 5   conversation_id      88135 non-null  object             
 6   text                 88135 non-null  object             
 7   in_reply_to_user_id  16544 non-null  object             
 8   retweet_count        88135 non-null  int64              
 9   reply_count          88135 non-null  int64              
 10  like_count           88135 non-null  int64              
 11  quote_count          88135 non-null  int64              
 12  impression_count     88135 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(6), object(6)
memory usage: 8.7+ MB
INSIGHTS - Most tweets are created recently 2022 to 2023. The reason for this is that the data scraped are the most recent tweets from each user to consider only the most relevant and up-to-date information that can be gathered. - For the 'possibly_sensitive' tweets, 646 tweets are tagged for this feature. 'possibly_sensitive' tagged tweets is an indicator that the URL contained in the tweet may contain content or media identified as sensitive content. - For engagement metrics we have retweet, like, reply and quote having mean values of 430, 87, 2.8K, and 86 counts respectively. Indicating that high engagements can be seen from the pool of influencers considered. - Lastly, aside from engagement, considering views and clicks we have 'impression_count' having an average count of 4.8K and maximum count of 111M impressions.
In [11]:
df_partner_tweets.describe()
Out[11]:
possibly_sensitive retweet_count reply_count like_count quote_count impression_count
count 88135.00000 88135.000000 88135.000000 8.813500e+04 88135.000000 8.813500e+04
mean 0.00733 430.759403 87.710773 2.827568e+03 86.941567 4.789302e+04
std 0.08530 4512.823739 924.958186 2.460507e+04 1585.003811 6.603273e+05
min 0.00000 0.000000 0.000000 0.000000e+00 0.000000 0.000000e+00
25% 0.00000 0.000000 0.000000 4.000000e+00 0.000000 0.000000e+00
50% 0.00000 4.000000 1.000000 3.100000e+01 0.000000 0.000000e+00
75% 0.00000 33.000000 9.000000 2.340000e+02 4.000000 5.474500e+03
max 1.00000 349510.000000 74964.000000 1.886537e+06 163748.000000 1.110853e+08
In [12]:
df_partner_tweets.hist(figsize=(20,10))
plt.suptitle(f'Fig. {fig_n}: Distribution of Tweet Metrics',
             fontsize=16);
_ = fig_count()
In [13]:
df_partner_tweets.retweet_count.value_counts()
Out[13]:
0        24629
1         8350
2         5346
3         3740
4         2905
         ...  
6054         1
11281        1
7666         1
2250         1
2006         1
Name: retweet_count, Length: 4075, dtype: int64
INSIGHTS To ensure diversity of the pool of potential influencers, we ensured that there are numerous number of accounts not followed by the Dove brand PH. From the above, it is evident that those tagged with a value of "1" are the accounts with mutual followback with the Dove Brand PH, and those that are tagged "0" are the potential influencers that Dove Brand is not following yet and is the pool of influencers considered for potential matching.
In [14]:
r_cnt = df_partners['rating'].value_counts()
print(f' No Followback = {r_cnt[0]} \
     \n Follow back = {r_cnt[1]}')
df_partners['rating'].value_counts().plot(kind='bar')
plt.title(f'Fig. {fig_n}: Distribution of Account-Following count.'
          '\n1 indicates a Follow-back, 0 otherwise.', fontsize=16);
_ = fig_count()
 No Followback = 856      
 Follow back = 59

VII. Method


Figure 6. The Project Methodology.
The influencers' profile and tweets were scraped using Twitter API. The profile description (i.e., Bio) and tweets of each were subjected to Text Pre-processing techniques before undergoing Clustering and the Recommendation System. Similarly, the profile of the brand (Dove) was obtained.


In [15]:
_ = fig_count()

VII.A. Text Pre-processing

The tweets were first cleaned by removing the links, usernames, and any unnecessary characters such as double spaces or extra punctuations as these were of no use to our research scope.


Keyword Extraction

Bag-of-Words Representation

After cleaning, the tweets were then represented as a bag-of-words vector [2] where each component corresponds to a unique word (token or term) and its value represents the number of times the word occurred in the text.

Lemmatization

 The bag-of-words were then lemmatized to reduce raw words into their lemmas. For this, the `WordNetLemmatizer` function imported from the `nltk.stem` library was applied.

 In linguistics, the process of lemmatization relates to the mechanism of clustering together inflected forms of a word and converting them to their lemma or dictionary-form terms [3]. Instead of stemming, another text-processing technique that modifies words into their root words, lemmatization is more accurate as it algorithmically processes and determines the lemma based on the word’s intended meaning [4, 5].

Term Frequency-Inverse Document Frequency (TF-IDF)

 Given the numerous words gathered after lemmatization, the `TfidfVectorizer` function from the `sklearn.feature_extraction.text` library was used to determine which words are relevant to the text, and penalize the frequently-occurring words.

 TF-IDF measures the relevance of each word in a document such that the more frequent it appears, the more relevant it is. However, the rarity of the word is also taken into account to ensure that the word is frequently appearing not because it is a common word, but because it is relevant [6, 7].

In [16]:
def remove_links(tweet):
    # Reference: https://ourcodingclub.github.io/tutorials/topic-modelling-python/
    """
    Remove web links from a given text
    
    Parameters
    ----------
    tweet : string
        Text to be stripped of web links

    Returns
    -------
    tweet : string
        Text without webs links
    """
    tweet = re.sub(r'http\S+', '', tweet) # remove http links
    tweet = re.sub(r'bit.ly/\S+', '', tweet) # remove bitly links
    return tweet

def remove_users(tweet):
    # Reference: https://ourcodingclub.github.io/tutorials/topic-modelling-python/
    """
    Remove user account information and retweet tags from a given text
    
    Parameters
    ----------
    tweet : string
        Text to be stripped of web links

    Returns
    -------
    tweet : string
        Text user information and retweet tags
    """
    tweet = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove retweet
    tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove tweeted at
    return tweet

def clean_tweet(tweet):
    # Reference: https://ourcodingclub.github.io/tutorials/topic-modelling-python/
    """
    Transform a given text into lowercase format, remove any user account
    information, retweet tags, and special characters.
    
    Parameters
    ----------
    tweet : string
        Text to be cleaned and formatted

    Returns
    -------
    tweet : string
        Cleaned text
    """
    tweet = remove_users(tweet)
    tweet = remove_links(tweet)
    tweet = re.sub(r'[^a-zA-Z]', ' ', tweet.lower()) # lowercase letters
    tweet = re.sub(fr'[{punctuations}]+', ' ', tweet) # strip punctuation
    tweet = re.sub('\s+', ' ', tweet) #remove double spacing
    tweet_tokens = tweet.split(' ') #regex_tokenizer.tokenize(tweet) #
    tweet_tokens = [WordNetLemmatizer().lemmatize(w) for w in tweet_tokens
                        if w not in stop_words]

    tweet = ' '.join(tweet_tokens)
    return tweet

def text_remove_unicode(text):
    text = re.sub(
        r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?",
        "", text)
    return text


class Lemmatizer:
    """Lemmatize text using WordNet"""

    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def __call__(self, text):
        return [
            self.wnl.lemmatize(word)
            for word
            in re.findall(r"(?u)(?<!<)[a-z]{2,}", text)
        ]

# define tagalog stop words and bad words [R]
tl_stopwords= ["pldt","pldtcares","pldthome","akin","aking","ako","alin",
               "am","amin","aming","ang","ano","anumang","apat","at","atin",
               "ating","ay","bababa","bago","bakit","bawat","bilang","dahil",
               "dalawa","dapat","din","dito","doon","gagawin","gayunman",
               "ginagawa","ginawa","ginawang","gumawa","gusto","habang",
               "hanggang","hindi","huwag","iba","ibaba","ibabaw","ibig",
               "ikaw","ilagay","ilalim","ilan","inyong","isa","isang",
               "itaas","ito","iyo","iyon","iyong","ka","kahit","kailangan",
               "kailanman","kami","kanila","kanilang","kanino","kanya",
               "kanyang","kapag","kapwa","karamihan","katiyakan","katulad",
               "kaya","kaysa","ko","kong","kulang","kumuha","kung","laban",
               "lahat","lamang","likod","lima","maaari","maaaring","maging",
               "mahusay","makita","marami","marapat","masyado","may",
               "mayroon","mga","minsan","mismo","mula","muli","na",
               "nabanggit","naging","nagkaroon","nais","nakita","namin",
               "napaka","narito","nasaan","ng","ngayon","ni","nila","nilang",
               "nito","niya","niyang","noon","o","pa","paano","pababa",
               "paggawa","pagitan","pagkakaroon","pagkatapos","palabas",
               "pamamagitan","panahon","pangalawa","para","paraan","pareho",
               "pataas","pero","pumunta","pumupunta","sa","saan","sabi",
               "sabihin","sarili","sila","sino","siya","tatlo","tayo",
               "tulad","tungkol","una","walang", "nyo", "niyo", "naman",
               "mo", "pls", "po", "kayo", "ba", "hi", "hello", "wala", "u",
               "nung", "nang", "kami", "kmi", "amp", "beh", "rin", "din",
               "jusko", "ha", "g", "kasi", "lang", "pi", "nadin", "narin",
               "e", "eh", "nga", "hey", "huy", "kayong", "nag", "paki", "pls"]

tl_badwords = ["amputa","animal ka","bilat","binibrocha","bobo","bogo",
               "boto","brocha","burat","bwesit","bwisit","demonyo ka",
               "engot","etits","gaga","gagi","gago","habal","hayop ka",
               "hayup","hinampak","hinayupak","hindot","hindutan","hudas",
               "iniyot","inutel","inutil","iyot","kagaguhan","kagang",
               "kantot","kantotan","kantut","kantutan","kaululan","kayat",
               "kiki","kikinginamo","kingina","kupal","leche","leching",
               "lechugas","lintik","nakakaburat","nimal","ogag","olok",
               "pakingshet","pakshet","pakyu","pesteng yawa","poke","poki",
               "pokpok","poyet","pu'keng","pucha","puchanggala","puchangina",
               "puke","puki","pukinangina","puking","punyeta","puta","putang",
               "putang ina","putangina","putanginamo","putaragis","putragis",
               "puyet","ratbu","shunga","sira ulo","siraulo","suso","susu",
               "tae","taena","tamod","tanga","tangina","taragis","tarantado",
               "tete","teti","timang","tinil","tite","titi","tungaw","ulol",
               "ulul","ungas", "yawa"]

stop_words = stopwords.words('english') + tl_stopwords + tl_badwords
punctuations = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'

exclude_words = stopwords.words(
    'english') + tl_stopwords + tl_badwords + ['twitter', 'account', 'official', 'new']

Visualization

In [17]:
def plot_wordcloud(data_dict, title):
    """
    Plot a word cloud of a set of input words

    Parameters
    ----------
    data_dict : dict
        Dictionary whose keys are the words, and their frequency as values
        
    title: str
        Title of the figure to plot
    """
    c = Counter(data_dict)
    res = {key: val for key, val in sorted(c.items(), key = lambda ele: ele[1], reverse=True)}
    
    mask_img = np.array(Image.open('dove_img.png'))
    wordcloud = (WordCloud(background_color ='white', colormap='Blues_r', #'gist_heat'
                           width=1500, height=800, mask=mask_img,
                           collocations=False, random_state=randstate)
                           .generate_from_frequencies(res))
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.title(f'Fig. {fig_n}: {title}', fontsize=15)
#     plt.savefig(f'wc_{title}.png', dpi=150, bbox_inches='tight')
    _ = fig_count()
    plt.show()

VII.B. Model Pipeline


Figure 7. The Model Pipeline.
The influencers was clustered based on their profile description. For each cluster, the item profiles and were used along with the Dove's ratings to create a user profile. This was then used as a basis for the Content-Based recommendations per cluster. Afterwards, the performance of the system was evaluated.


In [18]:
_ = fig_count()

Clustering

 When recommending influencers, it's essential to consider Dove's target audience's specific characteristics and interests. By clustering Twitter users prior employing recommender systems, we can identify different groups of influencers with similar demographic characteristics and interests. This allows us to provide specialized recommendations for each cluster based on the specific needs and preferences of Dove.


A. Influencer Profile description

 In identifying the natural clustering of influencers, features such as their activity, location, follower ratio, tenure, and keywords used in their bio were considered. Their Twitter bio descriptions are further broken down or reduced to keywords through the process of TF-IDF and lemmmatization.

In [19]:
def df_for_clustering(df, to_drop=True):
    """
    Function that preprocesses features necessary for clustering
    """
    def filter_char(c): return ord(c) < 256
    data = df.copy()
    data['description'] = (data['description'].str.lower()
                           .apply(lambda s: ''.join(filter(filter_char, s)))
                           .apply(text_remove_unicode)
                           .apply(lambda x: " ".join([re.sub('[^A-Za-z]+',
                                                             '', x)
                                                      for x in
                                                      nltk.word_tokenize(
                               x)
                           ]))
                           .apply(lambda x: re.sub(' +', ' ', x))
                           .apply(lambda x: " ".join([x for x in x.split()
                                                      if x not in
                                                      exclude_words
                                                      ]))
                           )
    data['tenure'] = 2023 - pd.to_datetime(data.created_at).dt.year
    data['has_location'] = np.where(data.location.isna(), 0, 1)
    data['follower_ratio'] = data.followers_count/data.following_count
    data['follower_ratio'] = (data.follower_ratio.replace(np.inf, np.nan)
                              .fillna(data.followers_count))

    tfidfvectorizer = TfidfVectorizer(
        token_pattern=None,
        tokenizer=Lemmatizer(),
        stop_words=stop_words+exclude_words+["u", "im", "dont"],
        max_df=0.7,
        min_df=0.01
    ).fit(data.description)

    data_descrip = pd.DataFrame(
        tfidfvectorizer.transform(data.description).todense(),
        columns=tfidfvectorizer.get_feature_names_out(),
        index=range(1, len(data.description)+1)
    ).reset_index(drop=True)

    final = (pd.concat([data,
                       data_descrip], axis=1)
             .set_index(['id', 'username', 'name']))

    if to_drop:
        final.drop(columns=['followers_count', 'following_count',
                            'description', 'created_at', 'location',
                            'protected'],
                   inplace=True)

    final[['tweet_count',
           'listed_count',
           'tenure',
           'follower_ratio']] = (StandardScaler().fit_transform(
               final[['tweet_count', 'listed_count',
                      'tenure', 'follower_ratio']])
    )

    return final
In [20]:
# Read the table that contains the features for clustering, and generate
df_ = (pd.read_sql("""SELECT * FROM partners""", conn)
      .query('protected==0', engine='python')
      .reset_index(drop=True)
      .drop(columns=['url', 'included'])
     )
df_clean = df_for_clustering(df_)
/home/msds2023/mmenorca/.local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:409: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['animal', 'demonyo', 'doe', 'hayop', 'ina', 'keng', 'pesteng', 'pu', 'sira', 'ulo', 'wa'] not in stop_words.
  warnings.warn(

B. Truncated Singular Value Decomposition (t-SVD)

 The keyword extraction from Twitter bios resulted in a higher dimensional and sparse matrix. Due to these reasons, dimensionality reduction via t-SVD was performed prior feeding into the clustering algorithm.

In [21]:
def truncated_svd(X):
    """
    Function that accepts the design matrix and returns q, sigma, p and the 
    normalized sum of squared distance from the origin
    """

    q, s, p = np.linalg.svd(X, full_matrices=False)
    Q = q
    S = np.diag(s)
    P = p.T
    NSSD = (s / np.sqrt(np.sum(s**2)))**2

    return Q, S, P, NSSD


def min_svs(df):
    """
    Function to get the minimum number of Singular Vectors that explain 
    at least 80% of variance
    """
    q, s, p, nssd = truncated_svd(df)
    nssd_cumsum = nssd.cumsum()
    return np.argwhere(nssd_cumsum >= 0.75)[0][0]+1
In [22]:
# Dimension reduction process
q_, s_, p_, nssd_ = truncated_svd(df_clean)
svd_ = TruncatedSVD(n_components=min_svs(df_clean),
                    random_state=1337,
                    algorithm='arpack')
df_svd = svd_.fit_transform(df_clean.astype(float))

C. $k$-Medoids

In this study, we made use of the $k$-medoids clustering algorithm since it is more robust to outliers, can handle a mix of numeric and categorical data, and chooses actual data points as centers in each iteration. Other clustering methods were also explored such as Agglomerative clustering, where the resulting cluster we're similar to $k$-Medoids, and Density-based algorithms, where only 1 cluster was identified. Implementation of these algorithms can be found in the supplementary notebook other_clustering_methods.ipynb

In [23]:
def cluster_range_kmedoids(X, k_start, k_stop, actual=None):
    """
    Function that accepts the design matrix, the initial and final values to 
    step through, and, optionally, actual labels. It returns a dictionary of 
    the cluster labels, cluster centers, internal validation values and, 
    if actual labels is given, external validation values, for every  𝑘 
    """

    ys = []
    cs = []
    inertias = []
    chs = []
    scs = []
    dbs = []
    gss = []
    gssds = []
    ps = []
    amis = []
    ars = []

    dist = euclidean
    X = np.asarray(X)

    for k in trange(k_start, k_stop+1):
        clusterer_k = kmedoids(X, np.arange(k), ccore=True)
        clusterer_k.process()
        clusters = clusterer_k.get_clusters()
        y_pred = np.zeros(len(X), dtype=int)
        for cluster, point in enumerate(clusters):
            y_pred[point] = cluster
        centers = X[clusterer_k.get_medoids()]

        ys.append(y_pred)
        cs.append(centers)

        res_dict = dict(zip(['ys', 'centers'], [ys, cs]))

        # internal validation metrics
        sse = np.sum([euclidean(x, c) ** 2 for i, c
                      in enumerate(centers) for x in X[y_pred == i]])
        inertias.append(sse)  # SS to centroids
        chs.append(calinski_harabasz_score(X, y_pred))  # Calinski-Hanbaz
        scs.append(silhouette_score(X, y_pred))  # Sillhouete score
        dbs.append(davies_bouldin_score(X, y_pred))  # Davies-Bouldin

        keys = ['inertias', 'chs', 'scs', 'dbs']
        values = [inertias, chs, scs, dbs]
        internal_dict = dict(zip(keys, values))

    res = {**res_dict, **internal_dict}

    return res
In [24]:
# Perform k-medoids clustering using k values from 2 to 10
res_kmedoid = cluster_range_kmedoids(df_svd, 2, 10)
100%|██████████| 9/9 [00:08<00:00,  1.01it/s]

D. Internal Validation Criteria

 To determine the optimal clusters in K-Medoids, Silhouette Coefficient (SC), Calinski-Harabasz (CH), and Davies-Bouldin (DB) score were used. High values of SC and CH, while low values of DB are desired. Based on these internal validation metrics, the optimal number of clusters obtained is 4. However, the last cluster contains only one data point/Twitter user (i.e., Taylor Swift); hence we drop this and retain only 3 final clusters.

In [25]:
def plot_internal(chs, scs, dbs):
    """Plot internal validation values"""
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    ks = np.arange(2, len(chs)+2)
    axes[0].plot(ks, chs, '-ro', label='CH')
    axes[0].set_xlabel('$k$')
    axes[0].set_ylabel('CH')
    axes[1].plot(ks, scs, '-ko', label='Silhouette coefficient')
    axes[1].set_xlabel('$k$')
    axes[1].set_ylabel('Silhouette')
    axes[2].plot(ks, dbs, '-gs', label='DB')
    axes[2].set_xlabel('$k$')
    axes[2].set_ylabel('DB')
    plt.subplots_adjust(wspace=0.4)
    plt.title(f'Fig. {fig_n}: Clustering Internal Validation Criteria', fontsize=16)
    _ = fig_count()
    return axes


def plot_clusters(X, ys, centers, transformer):
    """Plot clusters given the design matrix and cluster labels"""
    k_max = len(ys) + 1
    k_mid = k_max//2 + 2
    fig, ax = plt.subplots(2, k_max//2, dpi=150, sharex=True, sharey=True,
                           figsize=(7, 4),  # subplot_kw=dict(aspect='equal'),
                           gridspec_kw=dict(wspace=0.01))
    for k, y, cs in zip(range(2, k_max+1), ys, centers):
        centroids_new = transformer.transform(cs)
        if k < k_mid:
            ax[0][k % k_mid-2].scatter(*zip(*X), c=y, s=1, alpha=0.8)
            ax[0][k % k_mid-2].scatter(
                centroids_new[:, 0],
                centroids_new[:, 1],
                s=10,
                c=range(int(max(y)) + 1),
                marker='s',
                ec='k',
                lw=1
            )
            ax[0][k % k_mid-2].set_title('$k=%d$' % k)
        else:
            ax[1][k % k_mid].scatter(*zip(*X), c=y, s=1, alpha=0.8)
            ax[1][k % k_mid].scatter(
                centroids_new[:, 0],
                centroids_new[:, 1],
                s=10,
                c=range(int(max(y))+1),
                marker='s',
                ec='k',
                lw=1
            )
            ax[1][k % k_mid].set_title('$k=%d$' % k)
    ax.suptitle(f'Fig. {fig_n}: $k-$medoids clusters', fontsize=16)
    _ = fig_count()
    return ax

def plot_describe_cluster(df, cluster):
    """
    Function that outputs analysis per cluster via wordcloud and radar plots
    """
    dict_clustname = {1: 'News/Entertainment Outlets', 2: 'Celebrities',
                      3: 'Social Media Macro & Micro Influencers'}
    dict_color_radar = {1: '#feae02', 2: '#888888', 3: '#1c43c9'}
    dict_color_cloud = {1: 'Wistia', 2:'gray', 3:'Blues'}

    fig, ax = plt.subplots(1, 1, figsize=(10, 6), dpi=100)

    # Wordcloud
    text = ' '.join([word for word in
                     set(df[(df.cluster == cluster)]
                         ['description'])
                     if word not in exclude_words])
    mask_img = np.array(Image.open('dove_img.png'))
    wordcloud = WordCloud(background_color='white',
                          collocations=False,
                          mask=mask_img,
                          colormap=dict_color_cloud[cluster]).generate(text)
    ax.imshow(wordcloud, interpolation="bilinear")
    plt.title(
        f'Fig. {fig_n}: Cluster {cluster}: {dict_clustname[cluster]}, {df[(df.cluster == cluster)].shape[0]} Twitter accounts')
    _ = fig_count()
    ax.axis('off')
    plt.show()
    
    # Radar Plot
    df_med = (df
              [['followers_count', 'following_count', 'tweet_count',
                  'listed_count', 'tenure', 'cluster']]
              .groupby('cluster').agg('median')
              )
    df_med.iloc[:, :] = MinMaxScaler().fit_transform(df_med)

    categories = df_med.columns.tolist()
    categories = [*categories, categories[0]]

    fig = go.Figure()

    r_ = df_med.iloc[cluster-1].values[0:].tolist()
    r_ = [*r_, r_[0]]
    fig.add_trace(go.Scatterpolar(
        r=r_,
        theta=categories,
        fill='toself',
        name=str(df_med.index[cluster-1]),
        line_color=dict_color_radar[df_med.index[cluster-1]],
        opacity=0.7
    ))

    fig.update_layout(template=None, plot_bgcolor="rgba(0,0,0,0)",
                      paper_bgcolor="rgba(0,0,0,0)",
                      polar=dict(radialaxis=dict(angle=90,
                                                 tick0=1,
                                                 dtick=0.5,
                                                 range=[-1, 1.45],
                                                 tickangle=90,
                                                 titlefont={"size": 15, }),
                                 angularaxis=dict(rotation=162,
                                                  tickfont={"size": 15})),
                      showlegend=False)
    return fig
In [26]:
plot_internal(res_kmedoid['chs'],
              res_kmedoid['scs'], res_kmedoid['dbs']);

Content-Based Recommendation System


A. Item profiles

 20 tweets were randomly sampled and were used to create an item profile for each influencer. These tweets then undergone the text pre-processing pipeline described in section VII.A. The resulting item profile is a matrix of influencers as rows, the keywords of their tweets as columns, and the Term Frequency-Inverse Document Frequency of the keywords as the values.

In [27]:
tweet_count = 20
d_partner_tweets = {}

# Randomly sample 20 tweets per partner
for idx, k in df_partner_tweets.groupby('author_id'):
    author = k.author_id.unique()[0]
    try:
        k_sample = k.sample(n=tweet_count) #, random_state=randstate
        d_partner_tweets[author] = ' '.join(k_sample.text)
    except ValueError:
        d_partner_tweets[author] = ' '.join(k.text)
In [28]:
# Clean the sample tweets
df_partner_tweets_s = (pd.DataFrame({'text': d_partner_tweets})
                      .reset_index().rename(columns={'index': 'author_id'}))
df_partner_tweets_s['clean_text'] = df_partner_tweets_s.apply(lambda x: clean_tweet(x.text), axis=1)

# Vectorize
tfidfvectorizer = TfidfVectorizer(
                        token_pattern=r'[a-z-]+', 
                        max_df=0.7,
                        min_df=0.05
                    ).fit(df_partner_tweets_s.clean_text)
In [29]:
# Create the item profiles of each influencer
df_item_profiles = pd.DataFrame(
                        tfidfvectorizer.transform(df_partner_tweets_s.clean_text).todense(),
                        columns=tfidfvectorizer.get_feature_names_out(),
                        index=range(1, len(df_partner_tweets_s.clean_text)+1)
                    )

B. User Profile

User Rating

 In the absence of explicit user ratings, a weighted score was created as a measure of how much Dove "likes" a particular influencer. As shown in equation (\ref{eq:weighted_score}), this weighted score, $w_r$ considers whether the influencer is within the brand's social circle or not (i.e., Dove is following them or not), and the influencer's average tweets per year and number of followers. As expected, $w_r$ will just be equal to 0 for influencers that Dove does not follow.

\begin{equation} w_r = \beta_r \times \left( 0.60 \times f_r + 0.40 \times t_{r, avg} \right) \tag{1} \label{eq:weighted_score} \end{equation} \begin{equation} u_r = \frac{cf_{b, w_r} + 0.5f_{w_r}}{N} \times 100 \tag{2} \label{eq:user_rating} \end{equation}

 For each influencer, $r$, $\beta$ is the binary rating given by Dove such that the value is 1 if Dove follows them, and 0 otherwise, $f_r$ is its number of followers, and $t_{r, avg}$ is its average tweets per year. Note that $w_r$ will be equal to 0 if Dove does not follow the influencer.

 The final user rating, $u_r$ is the percentile rank of each influencer based on $w_r$ as shown in equation (\ref{eq:user_rating}). $cf_{b, w_r}$ is the Cumulative Frequency below $w_r$, $f_{w_r}$ is the frequency of the rating, and $N$ is the total number of influencers observed.

Aggregated User Profile

 The mean-centered user ratings, $u_r$ was used as a weight to each influencers' item profile. The resulting user profile is then the weighted sum of the item profiles of the rated influencers.

In [30]:
# Add derived variables
df_partners['tenure'] = pd.datetime.now().year - df_partners.created_at.dt.year
df_partners['avg_tweets_per_year'] = df_partners.tweet_count / df_partners.tenure

# Estimate a rating based on followers and tweet count
rated = df_partners.loc[df_partners.rating == 1]
df_partners['w_rating'] = (df_partners.loc[rated.index]
                                .apply(lambda x: (x.followers_count*0.60)
                                               + (x.avg_tweets_per_year*0.40),
                                                 axis=1))
df_partners['w_rating_pct'] = df_partners.loc[rated.index].w_rating.rank(pct=True, ascending=True)
/tmp/ipykernel_9999/2194494515.py:2: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
  df_partners['tenure'] = pd.datetime.now().year - df_partners.created_at.dt.year
In [31]:
def compute_user_profile_agg_numeric(df_utility, df_item_profiles):
    """
    Return the profile of a given user with numeric ratings.
    
    Parameters
    ----------
    df_utility : pandas DataFrame
        Utility matrix
    df_item_profiles : pandas DataFrame
        Item profiles
    user : int
        User (joke) ID
        
    Returns
    -------
    pandas Series
        Numeric user profile
    """
    weights = (df_utility - df_utility.mean())
    return (df_item_profiles * weights.to_numpy()[:, np.newaxis]).mean()

C. Recommendation

$$d_\text{cos}(\vec v_1, \vec v_2) = 1 - S_\text{cos}(\vec v_1, \vec v_2) = 1 - \frac{\vec v_1 \cdot \vec v_2}{\left|\left|\vec v_1\right|\right| \left|\left| \vec v_2 \right|\right|} \tag{3} \label{eq:cos_dist}.$$

 The Cosine distance [2] given by equation (\ref{eq:cos_dist}) between the item profile of each influencer, $\vec v_1$ and the user profile, $\vec v_2$ was calculated to determine which influencers have "similar" tone, ideals, and values as the brand, Dove.

  For each cluster, $L = 40$ influencers were initially chosen for review, then 5 were shortlisted to present to Dove as a recommendation.

In [32]:
def recommend_agg(df_utility, df_item_profiles, user_profile, L):
    """
    Return a list of recommended unrated items to a user, sorted
    from most recommended to least then by joke id.
    
    Parameters
    ----------
    df_utility : pandas DataFrame
        Utility matrix
    df_item_profiles : pandas DataFrame
        Item profiles
    user_profiles : pandas DataFrame
        User profiles
    user : int
        User (joke) ID
    L : int
        Number of recommendations
        
    Returns
    -------
    list
        IDs of recommended items
    """
    unrated_idx = df_utility[df_utility.isnull()].index
    s_reco = (df_item_profiles.loc[unrated_idx]
                              .apply(lambda x: cosine(user_profile, x), axis=1))
    d_reco = sorted(s_reco[s_reco > 0].to_dict().items(), key=lambda x: (x[1], x[0]))
    return list(list(zip(*d_reco))[0][:L])

D. Performance Evaluation

 The performance of the recommender system was evaluated based on how useful the recommendations are to the brand. In this case, a recommendation is deemed useful if the more important or similar influencers appear first in the list.

 A popular way of measuring this is via the discounted cumulative gain (DCG) and normalized discounted cumulative gain (NDCG) [8]. These are defined as

$$ DCG = \frac{1}{m} \sum_{u=1}^m \sum_{j \in I_u, v_j \leq L} \frac{2^{r_{uj}}}{\log_2 \left(v_j +1\right)}; \\ NDCG = \frac{DCG}{IDCG}, $$

 where $L$ is the number of recommended items, $v_j \in {1...L}$ is the rank of the item in the recommendation, $r_{uj}$ is the actual rating of user $u$ on item $j$ and $I_u$ is the set of items rated by user $u$. IDCG is the idealized DCG which is DCG when the sorting follows the ground-truth rankings.

 The values of NDCG ranges from 0 to 1, with 1 indicating the best performance.

In [33]:
def evaluate_recsys(sim_size, N, df_utility, df_item_profiles):
    """
    Return the Discounted Cumulative Gain and Normalized Discounted
    Cumulative Gain of a recommender system.
    
    Parameters
    ----------
    sim_size : int
        Number of simulations or trials
    N : int
        Number of recommendations to be made per trial
    df_utility : pandas DataFrame
        Utility matrix
    df_item_profiles : pandas DataFrame
        Item profiles
        
    Returns
    -------
    dcg : float
        Discounted Cumulative Gain
    ndcg : float
        Normalized Discounted Cumulative Gain
    """
    dcg = []
    ndcg = []
    
    np.random.seed(randstate)
    for s in range(sim_size):
#         if s % 5 == 0:
#             print(s)
        
        rated = df_utility.dropna()
        test_idx = np.random.choice(rated.index, size=N, replace=False)
        rated[test_idx] = np.nan

        eval_item_profiles = df_item_profiles.loc[rated.index]
        eval_user_profile = compute_user_profile_agg_numeric(rated, eval_item_profiles)    

        y_pred = recommend_agg(rated, eval_item_profiles, eval_user_profile, N)
        y_true = (df_utility[y_pred].sort_values(ascending=False, kind='mergesort')
                                    .index.tolist())

        pred_rel = {y_pred[r]: len(y_pred)-r for r in range(len(y_pred))}
        true_rel = {y_true[r]: len(y_true)-r for r in range(len(y_true))}

        df_rel = (pd.DataFrame([pred_rel, true_rel],
                               index=['pred_rel', 'true_rel']).T)

        dcg.append(dcg_score([df_rel.true_rel.tolist()], [df_rel.pred_rel.tolist()]))
        ndcg.append(ndcg_score([df_rel.true_rel.tolist()], [df_rel.pred_rel.tolist()]))
        
    return dcg, ndcg

VIII. Results and Discussion


VIII.A. Clustering

 The potential partners of Dove were clustered into three groups based on the results of the internal validation metrics, namely News & Entertainment Outlets, Celebrities, and Social Media Macro & Micro Influencers.

In [34]:
kmo = kmedoids(np.asarray(df_svd), np.arange(4), ccore=True)
kmo.process()
clusters = kmo.get_clusters()
y_kmedoid = np.zeros(len(df_svd))
for cluster, point in enumerate(clusters):
    y_kmedoid[point] = cluster

df_clusters = df_for_clustering(df_, to_drop=False)
df_clusters['cluster'] = np.int64(y_kmedoid)
df_clusters.cluster = df_clusters.cluster.map({2:-1, 0:1, 1:2, 3:3})

df_clusters = df_clusters[df_clusters.cluster > 0].reset_index()

plt.figure(dpi=100, figsize=(6, 4))
plt.scatter(df_svd[:, 0], df_svd[:, 1], c=y_kmedoid)
plt.title(f'Fig. {fig_n}: Final K-medoids clustering\n' \
                'retaining only 3 clusters', fontsize=14)
_ = fig_count()
plt.ylim([-5, 10])
plt.show()
/home/msds2023/mmenorca/.local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:409: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['animal', 'demonyo', 'doe', 'hayop', 'ina', 'keng', 'pesteng', 'pu', 'sira', 'ulo', 'wa'] not in stop_words.
  warnings.warn(

Cluster 1: News & Entertainment Outlets.

  This cluster of potential partners are mostly composed of news and entertainment accounts. As observed in the word cloud, their Twitter bio descriptions mostly pertain to news such as `breaking`, `latest`, and `stories`. In terms of quantitative features, these potential partners have relatively high follower-to-following ratio, tweet counts, and longest tenure in the Twitter platform.

In [35]:
plotly.offline.init_notebook_mode()
plot_describe_cluster(df_clusters, 1)

Cluster 2: Celebrities.

 This cluster of potential partners are mostly composed of celebrity accounts. Its bio wordcloud compose of keywords relating to official accounts of celebrities such as `instagram`, `singer`, and `host`. In terms of quantitative features, these potential partners have the lowest follower-to-following ratio, tweet counts, and shortest tenure in the Twitter platform relative to the other clusters obtained. This may mean that most of these accounts are not as active as news outlets and social media influencers.

In [36]:
plot_describe_cluster(df_clusters, 2)

Cluster 3: Social Media Macro & Micro Influencers.

 This cluster of potential partners are mostly composed of influencers that rose to fame through social media platforms like Youtube, Instagram, and TikTok. It's worth noting that some accounts are also celebrities who may have been placed into this group due to similarities in their behavior to social media influencers. As observed in the word cloud, their Twitter bio descriptions mostly pertain to news such as `youtube`, `fashion`, and `travel`. In terms of quantitative features, these potential partners have relatively high following counts and long tenure.

In [37]:
plot_describe_cluster(df_clusters, 3)

VIII.B. Content-Based Recommendation System

 The recommendations made in this section assumes that Dove PH aims to work with local companies and influencers only. Thus, international companies and influencers were excluded.

 Note that despite the exclusion of non-recommendable partners, the order in which the recommmendable partners appeared was still followed.

In [38]:
# Merge to follow the index of the items
df_partners = (df_partner_tweets_s[['author_id']]
                   .merge(df_partners, left_on='author_id', right_on='id', how='inner')
                   .merge(df_clusters[['id', 'cluster']], on='id', how='left')
                   .drop('author_id', axis=1))
df_partners.index = (range(1, len(df_partner_tweets_s)+1))

# Replace 0 ratings with null
df_partners.rating.replace({0: np.nan}, inplace=True)

# Define the utility matrix
brand_utility = df_partners.w_rating_pct
In [39]:
d_user_profiles = {}
d_recos = {}
L = 40
for cluster, s in df_partners.groupby('cluster'):
    if cluster != -1: # outlier
        c_utility = brand_utility[s.index]
        c_items = df_item_profiles.loc[s.index]
        c_user_profile = compute_user_profile_agg_numeric(c_utility, c_items)
        d_user_profiles[cluster] = c_user_profile
        d_recos[cluster] = recommend_agg(c_utility, c_items, c_user_profile, L)

News/Entertainment Outlets

 If Dove wishes to partner with News/Entertainment outlets, they can consider ABS-CBN News (`ABSCBNNews`), SKY (`SKYserves`), Star Cinema (`StarCinema`), CNN (`CNN`), and SMART (`LiveSmart`) as potential partners. The defining characteristic of these influencers' content that matched with that of Dove's is the presence of words such as thank, movie, help, free, message, and engage.

In [40]:
# News & Entertainment companies
c1_recos = df_partners.loc[d_recos[1]] 

# Assumption: Dove PH can only work with PH companies
c1_valid = [170, 128, 637, 825, 818]
c1_recos_ph = c1_recos.loc[c1_valid]
c1_recos_ph
Out[40]:
id description created_at username protected name url location followers_count following_count tweet_count listed_count included rating tenure avg_tweets_per_year w_rating w_rating_pct cluster
170 15872418 Stories, video, and multimedia for Filipinos w... 2008-08-16 10:09:33+00:00 ABSCBNNews 0 ABS-CBN News https://t.co/9yfQzguRRD Manila, Philippines 8917322 1080 1043998 8542 1.0 NaN 15 69599.866667 NaN NaN 1.0
128 150165941 2010-05-31 07:28:03+00:00 SKYserves 0 SKYserves https://t.co/dEHArWO9y1 Philippines 134029 12830 622443 188 1.0 NaN 13 47880.230769 NaN NaN 1.0
637 39956328 This is the OFFICIAL Twitter account of Star C... 2009-05-14 08:37:24+00:00 StarCinema 0 Star Cinema https://t.co/2ksfh8SqTN Philippines 1936568 588 403436 661 1.0 NaN 14 28816.857143 NaN NaN 1.0
825 759251 It’s our job to #GoThere & tell the most diffi... 2007-02-09 00:35:02+00:00 CNN 0 CNN https://t.co/imGp4Ieixi None 61258353 1095 399299 157950 1.0 NaN 16 24956.187500 NaN NaN 1.0
818 74409069 The official Twitter account of Smart Communic... 2009-09-15 09:38:04+00:00 LiveSmart 0 SMART https://t.co/2P0bZUPKVK Philippines 1525578 48843 414056 1420 1.0 NaN 14 29575.428571 NaN NaN 1.0
In [41]:
c1_recos_ph_tweets = df_partner_tweets_s.loc[df_partner_tweets_s.author_id.isin(c1_recos_ph.id)]
c1_tokens = np.concatenate(c1_recos_ph_tweets.clean_text.str.split().tolist())
plot_wordcloud(c1_tokens, 'Tweets of Recommended News/Entertainment Outlets')

Celebrities

 If Dove wishes to partner with Celebrities, they can consider Janine Gutierrez (`janinegutierrez`), Ylona Garcia (`ylona_garcia`), Xian Lim (`XianLimm`), Liza Soberano (`lizasoberano`), and Shamcey Supsup (`supsup_shamcey`) as potential partners. The defining characteristic of these influencers' content that matched with that of Dove's is the presence of words such as thank, love, happy, watching, feel, and guys.

 Dove mostly partner with woman to echo their campaign, however in the results presented Xian Lim can be considered a serendipitous result. This may possibly be due to his recent blog this year, 2023, which is centered on "self-love." The vlog's topic may have been associated with "empowerment"--one of the major values being promoted by Dove.

 Other results lead to Janine Gutierrez, a National Youth Ambassador; Shamcey Supsup, a very influential Miss Universe Candidate; and Liza Soberano which rebranded herself just recently. All of which are relevant potential partners for the brand.

In [42]:
# Celebrities
c2_recos = df_partners.loc[d_recos[2]]

c2_valid = [405, 443, 448, 490, 489] 
c2_recos_ph = c2_recos.loc[c2_valid]
c2_recos_ph
Out[42]:
id description created_at username protected name url location followers_count following_count tweet_count listed_count included rating tenure avg_tweets_per_year w_rating w_rating_pct cluster
405 240611246 film. fashion. family. Philippines ✨ @wwfphili... 2011-01-20 09:40:33+00:00 janinegutierrez 0 JANINE https://t.co/KTIV1GjfYj Manila 226750 594 27249 87 1.0 NaN 12 2270.750000 NaN NaN 2.0
443 2613171312 /ee-lona/ • Wanderland - Mar 4 2014-07-09 08:34:15+00:00 ylona_garcia 0 ylona. https://t.co/LQMmKJmbvX Los Angeles, CA 700520 49 19569 97 1.0 NaN 9 2174.333333 NaN NaN 2.0
448 264058981 Artist/ Filmmaker/ Painter 2011-03-11 07:42:11+00:00 XianLimm 0 XIAN LIM None None 3053233 423 22648 1730 1.0 NaN 12 1887.333333 NaN NaN 2.0
490 284291853 Imperfection is beauty instagram: @lizasoberan... 2011-04-19 01:02:58+00:00 lizasoberano 0 Liza Soberano None None 4904453 173 14557 547 1.0 NaN 12 1213.083333 NaN NaN 2.0
489 283783357 No need to shout to be heard. Sometimes the be... 2011-04-18 01:09:33+00:00 supsup_shamcey 0 shamcey supsup lee None None 300110 91 898 0 1.0 NaN 12 74.833333 NaN NaN 2.0
In [43]:
c2_recos_ph_tweets = df_partner_tweets_s.loc[df_partner_tweets_s.author_id.isin(c2_recos_ph.id)]
c2_tokens = np.concatenate(c2_recos_ph_tweets.clean_text.str.split().tolist())
plot_wordcloud(c2_tokens, 'Tweets of Recommended Celebrities')

Social Media Micro- & Macro-Influencers

 If Dove wishes to partner with Social Media Micro- & Macro-Influencers, they can consider Tirso Cruz III (`tirsocruziii`), Dolly Carjaval (`dollyannec`), Sam Pinto (`SamPinto_`), Mikael Daez (`mikaeldaez`), and Paula Taylor (`paulataylor`) as potential partners. The defining characteristic of these influencers' content that matched with that of Dove's is the presence of words such as photo, posted, boutique, resort, swipe, and get.

 For this cluster, serendipitous influencer might be Tirso Cruz III. However, it is interesting to note that Tirso is a cancer survivor and has been an advocate of cancer ever since. This form of empowerment might have lead to how he was part of the recommendation for this cluster. Other potential candidates include Dolly Carvajal, an entertainment columnist; and Sam Pinto,a newly-wed and a new mom.

In [44]:
# Social Media influencers
c3_recos = df_partners.loc[d_recos[3]]

# Choose the top 5 only
c3_recos_ph = c3_recos.head(5)
c3_recos_ph
Out[44]:
id description created_at username protected name url location followers_count following_count tweet_count listed_count included rating tenure avg_tweets_per_year w_rating w_rating_pct cluster
196 168221275 2010-07-18 19:03:13+00:00 tirsocruziii 0 Tirso Cruz III None None 125367 270 10450 106 1.0 NaN 13 803.846154 NaN NaN 3.0
66 129144705 Vodka Queen, MJ fanatic, proud single mom, hop... 2010-04-03 09:20:27+00:00 dollyannec 0 Dolly Anne Carvajal None on the edge:-) 53964 820 24268 132 1.0 NaN 13 1866.769231 NaN NaN 3.0
726 54866152 Facebook Page - hellosampinto • Instagram and ... 2009-07-08 11:15:31+00:00 SamPinto_ 0 Sam Pinto https://t.co/boI6v2rK9V Republic of the Philippines 1316879 313 30353 1732 1.0 NaN 14 2168.071429 NaN NaN 3.0
674 46081647 I guess i had a twitter account all along ;) h... 2009-06-10 10:24:56+00:00 mikaeldaez 0 Mikael Daez http://t.co/u3QCvSb4L3 Philippines 204286 490 18905 95 1.0 NaN 14 1350.357143 NaN NaN 3.0
666 44102517 2009-06-02 11:35:40+00:00 paulataylor 0 Paula Taylor http://t.co/86UQVtBi3M None 592520 99 8145 1356 1.0 NaN 14 581.785714 NaN NaN 3.0
In [45]:
c3_recos_ph_tweets = df_partner_tweets_s.loc[df_partner_tweets_s.author_id.isin(c3_recos_ph.id)]
c3_tokens = np.concatenate(c3_recos_ph_tweets.clean_text.str.split().tolist())
plot_wordcloud(c3_tokens, 'Tweets of Recommended Social Media\nMicro- & Macro-Influencers')

VIII.C. Performance Evaluation

 From 50 trials with 5 recommendations each, the average Discounted Cumulative Gain (DCG) and Normalized Discounted Cumulative Gain (NDCG) is 9.28, and 0.904, respectively.

 Given that the average NDCG is close to 1, we can conclude that the recommendations made by our system is relevant to and personalized for the brand.

In [46]:
# Evaluate the recommender system
sim_size = 50
N = 5

test_dcg, test_ndcg = evaluate_recsys(sim_size, N, brand_utility, df_item_profiles)
print(f'Avg. DCG: {np.mean(test_dcg):.4f} -- Avg. NDCG: {np.mean(test_ndcg):.4f}')

# Plot the evaluation results
plt.figure(figsize=(8, 4), dpi=100)
plt.plot(range(1, sim_size+1), test_ndcg, marker='o', label='DCG', color='#003B7F')
plt.xlabel('Trial')
plt.ylabel('Normalized Discounted\nCumulative Gain (NDCG)')
plt.ylim(0.65, 1.05)
plt.title(f'Fig. {fig_n}: Recommender System Performance', fontsize=15)
_ = fig_count()
# plt.savefig(f'eval_ndcg.png', dpi=150, bbox_inches='tight');
Avg. DCG: 9.2845 -- Avg. NDCG: 0.9039

IX. Limitations & Future Work

 The work focused on 3 social networks only: Dove's, Anne Curtis', and Alden Richards'. This introduced a constraint on the potential influencers that the model could choose from.

 Given the absence of explicit ratings, the user ratings were assumed to be based on followers and activity of an influencer. This was arbitrarily chosen based on the use case.

 Only 20 tweets per influencer were considered when creating the item profiles. These were assumed to be representative of the tone, values, or ideals of an influencer that could be matched with that of the brand's.

 Given these limitations, future improvements may include (a) expanding the data on influencers to get a good diverse sample, (b) firming up the text pre-processing pipeline to get better clustering and recommendations, (c) identifying the keywords that best describe an influencer through TF-IDF or Topic Modeling instead of randomly selecting 20 tweets, and (d) incorporating an explainability algorithm into the pipeline to better guide brands on what features dictate the result of the matching process the most.


X. Conclusion

 Influencers can be categorized into three groups: (1) News and Entertainment outlets, (2) Celebrities, and (3) Social Media Micro- and Macro-influencers. Different brands and companies will have varying preferences on who to collaborate with, hinged on a few crucial factors: budget, intended reach or scope of influence, and whether or not the influencer's values match theirs.

 Depending on the goal, our Content-Based Recommendation system, with an average relevance score of 0.90, can guide a brand in choosing the right influencer to partner with in their marketing campaigns. An efficient recommender system can afford brands several advantages.

 An influencer whose values, audience, and interests align with the brand's target market ensures efforts are targeted, and the campaign's message reaches the right people. Influencer marketing can be more cost-effective than traditional advertising methods. Although high agency fees and commissions can be a limiting factor, which makes using a recommender system more cost-effective - allowing brands to identify influencers with a suitable audience size and engagement rate that fit their budget.

 Based on the goals for using it, a business can save man-hours that could've been spent in scouring the social media sites for potential partners. With this system, the business is already given a short-list of where to start that is more or less in line with their values as a business.

 On the inverse, this system can also be used by influencers to look at potential business partners that they can reach out to. This may give them busienss opportutnities that otherwise they would have not gotten.


XI. References

[1] De Jesus, S. (2016, December 5). Top 10 most followed Twitter accounts in PH for 2016. RAPPLER. https://www.rappler.com/technology/social-media/154622-top-10-most-followed-twitter-accounts-ph-2016/

[2] Alis, C. (2022). Information Retrieval and Searching by Similarity Part I.

[3] GeeksforGeeks. (2022a, November 7). Python  Lemmatization Approaches with Examples. https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/

[4] Srinidhi, S. (2021d, December 13). Lemmatization in Natural Language Processing (NLP) and Machine Learning. Medium. https://towardsdatascience.com/lemmatization-in-natural-language-processing-nlp-and-machine-learning-a4416f69a7b6

[5] Bernal, D. E., Bundoc, S. F., Escalante, S. O., Guinto, J. A., Mann, J. D., Menorca, M. L. (2022). PLDT Anuna?: Topic Modeling of Customer Concerns to Streamline Service Support.

[6] GeeksforGeeks. (2023, January 19). Understanding TF IDF  Term Frequency Inverse Document Frequency. https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

[7] Góralewicz, B. (2023, January 30). The TF*IDF Algorithm Explained. Onely. https://www.onely.com/blog/what-is-tf-idf/

[8] Alis, C. (2022). Content-based Recommender Systems.

[9] Nguyen, K. (2021, July 26). The Social Media Revolution: How Social Media Has Changed Marketing. BUSINESSNAV. https://businessnav.com/the-social-media-revolution-how-social-media-has-changed-marketing/

[10] Pec, T. (2022, September 6). Why Businesses And Brands Need To Be Taking Advantage Of Social Media. Forbes. https://www.forbes.com/sites/forbesagencycouncil/2022/09/06/why-businesses-and-brands-need-to-be-taking-advantage-of-social-media/?sh=255024e6216c

[11] Geyser, W. (2023, January 20). What is Influencer Marketing? – The Ultimate Guide for 2023. Influencer Marketing Hub. https://influencermarketinghub.com/influencer-marketing/#toc-0